Biostat 203B Homework 3

Due Feb 23 @ 11:59PM

Author

Li Zhang 206305918

Display machine information for reproducibility:

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.0

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.2    fastmap_1.1.1     cli_3.6.2        
 [5] tools_4.3.2       htmltools_0.5.7   rstudioapi_0.15.0 yaml_2.3.8       
 [9] rmarkdown_2.25    knitr_1.45        jsonlite_1.8.8    xfun_0.41        
[13] digest_0.6.34     rlang_1.1.3       evaluate_0.23    

Load necessary libraries (you can add more as needed).

library(arrow)

Attaching package: 'arrow'
The following object is masked from 'package:utils':

    timestamp
library(gtsummary)
library(memuse)
library(pryr)
library(R.utils)
Loading required package: R.oo
Loading required package: R.methodsS3
R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.26.0 (2024-01-24 05:12:50 UTC) successfully loaded. See ?R.oo for help.

Attaching package: 'R.oo'
The following object is masked from 'package:R.methodsS3':

    throw
The following objects are masked from 'package:methods':

    getClasses, getMethods
The following objects are masked from 'package:base':

    attach, detach, load, save
R.utils v2.12.3 (2023-11-18 01:00:02 UTC) successfully loaded. See ?R.utils for help.

Attaching package: 'R.utils'
The following object is masked from 'package:arrow':

    timestamp
The following object is masked from 'package:utils':

    timestamp
The following objects are masked from 'package:base':

    cat, commandArgs, getOption, isOpen, nullfile, parse, warnings
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::compose()      masks pryr::compose()
✖ lubridate::duration() masks arrow::duration()
✖ tidyr::extract()      masks R.utils::extract()
✖ dplyr::filter()       masks stats::filter()
✖ dplyr::lag()          masks stats::lag()
✖ purrr::partial()      masks pryr::partial()
✖ dplyr::where()        masks pryr::where()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(patchwork)

Display your machine memory.

memuse::Sys.meminfo()
Totalram:    8.000 GiB 
Freeram:   402.250 MiB 

In this exercise, we use tidyverse (ggplot2, dplyr, etc) to explore the MIMIC-IV data introduced in homework 1 and to build a cohort of ICU stays.

Q1. Visualizing patient trajectory

Visualizing a patient’s encounters in a health care system is a common task in clinical data analysis. In this question, we will visualize a patient’s ADT (admission-discharge-transfer) history and ICU vitals in the MIMIC-IV data.

Q1.1 ADT history

A patient’s ADT history records the time of admission, discharge, and transfer in the hospital. This figure shows the ADT history of the patient with subject_id 10001217 in the MIMIC-IV data. The x-axis is the calendar time, and the y-axis is the type of event (ADT, lab, procedure). The color of the line segment represents the care unit. The size of the line segment represents whether the care unit is an ICU/CCU. The crosses represent lab events, and the shape of the dots represents the type of procedure. The title of the figure shows the patient’s demographic information and the subtitle shows top 3 diagnoses.

Do a similar visualization for the patient with subject_id 10013310 using ggplot.

Hint: We need to pull information from data files patients.csv.gz, admissions.csv.gz, transfers.csv.gz, labevents.csv.gz, procedures_icd.csv.gz, diagnoses_icd.csv.gz, d_icd_procedures.csv.gz, and d_icd_diagnoses.csv.gz. For the big file labevents.csv.gz, use the Parquet format you generated in Homework 2. For reproducibility, make the Parquet folder labevents_pq available at the current working directory hw3, for example, by a symbolic link. Make your code reproducible.

Patient of interest:

sid <- 10013310

Import ‘transfers.csv.gz’ as a tibble sid_adt.

sid_adt <- read_csv("~/mimic/hosp/transfers.csv.gz") |>
  filter(subject_id == sid) |>
  print(width = Inf)
Rows: 1890972 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): eventtype, careunit
dbl  (3): subject_id, hadm_id, transfer_id
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 14 × 7
   subject_id  hadm_id transfer_id eventtype
        <dbl>    <dbl>       <dbl> <chr>    
 1   10013310 21243435    31696219 discharge
 2   10013310 21243435    31736720 ED       
 3   10013310 21243435    33511674 transfer 
 4   10013310 21243435    34848129 transfer 
 5   10013310 21243435    38910974 admit    
 6   10013310 22098926    31651850 transfer 
 7   10013310 22098926    32769810 admit    
 8   10013310 22098926    33278851 transfer 
 9   10013310 22098926    34063502 ED       
10   10013310 22098926    36029206 discharge
11   10013310 27682188    30077870 transfer 
12   10013310 27682188    30444898 discharge
13   10013310 27682188    31203589 admit    
14   10013310 27682188    35160955 ED       
   careunit                                        intime             
   <chr>                                           <dttm>             
 1 <NA>                                            2153-06-05 19:58:00
 2 Emergency Department                            2153-05-26 08:56:00
 3 Medicine/Cardiology                             2153-05-26 16:19:26
 4 Medicine/Cardiology                             2153-05-26 14:42:55
 5 Medicine/Cardiology                             2153-05-26 14:18:39
 6 Neuro Intermediate                              2153-06-12 16:31:33
 7 Neuro Surgical Intensive Care Unit (Neuro SICU) 2153-06-10 11:55:42
 8 Medicine                                        2153-06-16 19:03:14
 9 Emergency Department                            2153-06-10 10:40:00
10 <NA>                                            2153-07-21 18:02:28
11 Medicine/Cardiology                             2153-05-07 20:47:19
12 <NA>                                            2153-05-13 15:36:52
13 Coronary Care Unit (CCU)                        2153-05-06 18:28:00
14 Emergency Department                            2153-05-06 10:21:00
   outtime            
   <dttm>             
 1 NA                 
 2 2153-05-26 14:18:39
 3 2153-06-05 19:58:00
 4 2153-05-26 16:19:26
 5 2153-05-26 14:42:55
 6 2153-06-16 19:03:14
 7 2153-06-12 16:31:33
 8 2153-07-21 18:02:28
 9 2153-06-10 11:55:42
10 NA                 
11 2153-05-13 15:36:52
12 NA                 
13 2153-05-07 20:47:19
14 2153-05-06 18:28:00
data_labevents <- arrow::open_dataset("~/mimic/hosp/labevents.csv.gz", format = "csv")
arrow::write_dataset(data_labevents, "labevents.parquet")

Import ‘labevents.parquet’ as a tibble sid_lab.

sid_lab <- arrow::open_dataset("labevents.parquet") %>%
  dplyr::filter(subject_id == sid) %>%
  as_tibble() %>%
  print(width = Inf)
# A tibble: 2,285 × 16
   labevent_id subject_id hadm_id specimen_id itemid order_provider_id
         <int>      <int>   <int>       <int>  <int> <chr>            
 1      153564   10013310      NA     4841989  50887 ""               
 2      153565   10013310      NA     8958046  50934 ""               
 3      153566   10013310      NA     8958046  50947 ""               
 4      153567   10013310      NA     8958046  51003 ""               
 5      153568   10013310      NA     8958046  51678 ""               
 6      153569   10013310      NA    10682517  50933 ""               
 7      153570   10013310      NA    11713499  51133 ""               
 8      153571   10013310      NA    11713499  51146 ""               
 9      153572   10013310      NA    11713499  51200 ""               
10      153573   10013310      NA    11713499  51221 ""               
   charttime           storetime          
   <dttm>              <dttm>             
 1 2153-05-06 03:30:00 NA                 
 2 2153-05-06 03:30:00 2153-05-06 04:22:00
 3 2153-05-06 03:30:00 2153-05-06 04:22:00
 4 2153-05-06 03:30:00 2153-05-06 04:41:00
 5 2153-05-06 03:30:00 2153-05-06 04:22:00
 6 2153-05-06 03:30:00 NA                 
 7 2153-05-06 03:30:00 2153-05-06 04:09:00
 8 2153-05-06 03:30:00 2153-05-06 04:09:00
 9 2153-05-06 03:30:00 2153-05-06 04:09:00
10 2153-05-06 03:30:00 2153-05-06 04:09:00
   value                                    valuenum valueuom ref_range_lower
   <chr>                                       <dbl> <chr>              <dbl>
 1 HOLD.  DISCARD GREATER THAN 24 HRS OLD.     NA    ""                  NA  
 2 5                                            5    ""                  NA  
 3 2                                            2    ""                  NA  
 4 ___                                          2.97 "ng/mL"              0  
 5 14                                          14    ""                  NA  
 6 HOLD.  DISCARD GREATER THAN 4 HOURS OLD.    NA    ""                  NA  
 7 1.90                                         1.9  "K/uL"               1.2
 8 0.2                                          0.2  "%"                  0  
 9 0.1                                          0.1  "%"                  1  
10 32.5                                        32.5  "%"                 34  
   ref_range_upper flag       priority comments
             <dbl> <chr>      <chr>    <chr>   
 1           NA    ""         STAT     "___"   
 2           NA    ""         STAT     ""      
 3           NA    ""         STAT     ""      
 4            0.01 "abnormal" STAT     "___"   
 5           NA    ""         STAT     ""      
 6           NA    ""         STAT     "___"   
 7            3.7  ""         STAT     ""      
 8            1    ""         STAT     ""      
 9            7    "abnormal" STAT     ""      
10           45    "abnormal" STAT     ""      
# ℹ 2,275 more rows

Import ‘procedures_icd.csv.gz’ as a tibble sid_procedure.

sid_procedure <- read_csv("~/mimic/hosp/procedures_icd.csv.gz") |>
  filter(subject_id == sid) |>
  mutate(chartdate = as.POSIXct(chartdate)) |>
  print(width = Inf)
Rows: 669186 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): icd_code
dbl  (4): subject_id, hadm_id, seq_num, icd_version
date (1): chartdate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 9 × 6
  subject_id  hadm_id seq_num chartdate           icd_code icd_version
       <dbl>    <dbl>   <dbl> <dttm>              <chr>          <dbl>
1   10013310 21243435       1 2153-05-27 00:00:00 4A023N7           10
2   10013310 21243435       2 2153-05-27 00:00:00 B2111ZZ           10
3   10013310 21243435       3 2153-05-27 00:00:00 B241ZZ3           10
4   10013310 22098926       1 2153-06-10 00:00:00 03CG3ZZ           10
5   10013310 22098926       2 2153-06-10 00:00:00 3E05317           10
6   10013310 22098926       3 2153-07-15 00:00:00 0DH63UZ           10
7   10013310 22098926       4 2153-06-11 00:00:00 3E0G76Z           10
8   10013310 27682188       1 2153-05-06 00:00:00 027034Z           10
9   10013310 27682188       2 2153-05-06 00:00:00 B211YZZ           10

Import ‘d_icd_procedures.csv.gz’ as d_icd_procedure.

d_icd_procedure <- read_csv("~/mimic/hosp/d_icd_procedures.csv.gz") |>
  print(width = Inf)
Rows: 85257 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 85,257 × 3
   icd_code icd_version
   <chr>          <dbl>
 1 0001               9
 2 0002               9
 3 0003               9
 4 0009               9
 5 001               10
 6 0010               9
 7 0011               9
 8 0012               9
 9 0013               9
10 0014               9
   long_title                                                 
   <chr>                                                      
 1 Therapeutic ultrasound of vessels of head and neck         
 2 Therapeutic ultrasound of heart                            
 3 Therapeutic ultrasound of peripheral vascular vessels      
 4 Other therapeutic ultrasound                               
 5 Central Nervous System and Cranial Nerves, Bypass          
 6 Implantation of chemotherapeutic agent                     
 7 Infusion of drotrecogin alfa (activated)                   
 8 Administration of inhaled nitric oxide                     
 9 Injection or infusion of nesiritide                        
10 Injection or infusion of oxazolidinone class of antibiotics
# ℹ 85,247 more rows
sid_procedure <- left_join(sid_procedure, 
                           d_icd_procedure, 
                           by = c("icd_code", "icd_version")) %>%
  select(subject_id, hadm_id, seq_num,
         chartdate, icd_code, icd_version, long_title)

Import ‘patients.csv.gz’ as a tibble sid_patient.

sid_patient <- read_csv("~/mimic/hosp/patients.csv.gz") |>
  filter(subject_id == sid) |>
  print(width = Inf)
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 1 × 6
  subject_id gender anchor_age anchor_year anchor_year_group dod       
       <dbl> <chr>       <dbl>       <dbl> <chr>             <date>    
1   10013310 F              70        2153 2017 - 2019       2153-11-19

Import ‘diagnoses_icd.csv.gz’ as a tibble sid_diagnoses.

sid_diagnoses <- read_csv("~/mimic/hosp/diagnoses_icd.csv.gz") |>
  filter(subject_id == sid) |>
  print(width = Inf)
Rows: 4756326 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): icd_code
dbl (4): subject_id, hadm_id, seq_num, icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 71 × 5
   subject_id  hadm_id seq_num icd_code icd_version
        <dbl>    <dbl>   <dbl> <chr>          <dbl>
 1   10013310 21243435       1 I222              10
 2   10013310 21243435       2 I5023             10
 3   10013310 21243435       3 I428              10
 4   10013310 21243435       4 E1142             10
 5   10013310 21243435       5 E1165             10
 6   10013310 21243435       6 I213              10
 7   10013310 21243435       7 I110              10
 8   10013310 21243435       8 I2510             10
 9   10013310 21243435       9 M25511            10
10   10013310 21243435      10 E785              10
# ℹ 61 more rows

Import ‘d_icd_diagnoses.csv.gz’ as d_icd_diagnoses.

d_icd_diagnoses <- read_csv("~/mimic/hosp/d_icd_diagnoses.csv.gz") |>
  print(width = Inf)
Rows: 109775 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 109,775 × 3
   icd_code icd_version long_title                           
   <chr>          <dbl> <chr>                                
 1 0010               9 Cholera due to vibrio cholerae       
 2 0011               9 Cholera due to vibrio cholerae el tor
 3 0019               9 Cholera, unspecified                 
 4 0020               9 Typhoid fever                        
 5 0021               9 Paratyphoid fever A                  
 6 0022               9 Paratyphoid fever B                  
 7 0023               9 Paratyphoid fever C                  
 8 0029               9 Paratyphoid fever, unspecified       
 9 0030               9 Salmonella gastroenteritis           
10 0031               9 Salmonella septicemia                
# ℹ 109,765 more rows
sid_diagnoses <- left_join(sid_diagnoses, 
                           d_icd_diagnoses, 
                           by = c("icd_code", "icd_version")) %>%
  select(subject_id, icd_code, icd_version, long_title)

Extract the top 3 diagnostics:

top_3_diagnoses <- sid_diagnoses %>%
  slice_head(n = 3) %>%
  pull(long_title)

subtitle_diagnoses <- paste(top_3_diagnoses, 
                            collapse = "\n")

Import ‘admissions.csv.gz’ as a tibble sid_admissions.

sid_admissions <- read_csv("~/mimic/hosp/admissions.csv.gz") |>
  filter(subject_id == sid) |>
  print(width = Inf)
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 3 × 16
  subject_id  hadm_id admittime           dischtime           deathtime
       <dbl>    <dbl> <dttm>              <dttm>              <dttm>   
1   10013310 21243435 2153-05-26 14:18:00 2153-06-05 19:30:00 NA       
2   10013310 22098926 2153-06-10 11:55:00 2153-07-21 18:00:00 NA       
3   10013310 27682188 2153-05-06 18:03:00 2153-05-13 13:45:00 NA       
  admission_type    admit_provider_id admission_location       
  <chr>             <chr>             <chr>                    
1 OBSERVATION ADMIT P78TNY            INFORMATION NOT AVAILABLE
2 OBSERVATION ADMIT P09IS0            INFORMATION NOT AVAILABLE
3 URGENT            P89ZCW            TRANSFER FROM HOSPITAL   
  discharge_location       insurance language marital_status race         
  <chr>                    <chr>     <chr>    <chr>          <chr>        
1 HOME HEALTH CARE         Medicare  ?        SINGLE         BLACK/AFRICAN
2 SKILLED NURSING FACILITY Medicare  ?        SINGLE         BLACK/AFRICAN
3 HOME HEALTH CARE         Medicare  ?        SINGLE         BLACK/AFRICAN
  edregtime           edouttime           hospital_expire_flag
  <dttm>              <dttm>                             <dbl>
1 2153-05-26 08:56:00 2153-05-26 16:33:00                    0
2 2153-06-10 10:40:00 2153-06-10 11:25:00                    0
3 2153-05-06 10:21:00 2153-05-06 18:28:00                    0
sid_adt$source <- "ADT"
sid_lab$source <- "Lab"
sid_procedure$source <- "Procedure"

all_data <- bind_rows(sid_adt, sid_lab, sid_procedure)

Q1_plot <-
  ggplot(all_data) +
  geom_segment(
    data = filter(all_data, source == "ADT" & eventtype != "discharge"),
    aes(
      x = intime, xend = outtime,
      y = "ADT", yend = "ADT",
      color = careunit,
      linewidth = str_detect(careunit, "(ICU|CCU)")
    )
  ) +
  guides(linewidth = "none", shape = guide_legend(nrow = 5, ncol = 2)) +
  geom_point(
    data = filter(all_data, source == "Lab"),
    aes(x = charttime, y = "Lab"),
    shape = 3, size = 2
  ) +
  geom_point(
    data = filter(all_data, source == "Procedure"),
    aes(
      x = chartdate, y = "Procedure",
      shape = long_title
    ),
    position = position_jitter(height = 0.2),
    size = 3
  ) +
  labs(
    x = "", y = "",
    title = str_c("Patient ", sid, ", ", sid_patient$gender, ", ", sid_patient$anchor_age, " years old, ", tolower(sid_admissions$race)),
    subtitle = subtitle_diagnoses,
    shape = "Procedure", color = "Care Unit"
  ) +
  scale_y_discrete(limits = c("Procedure", "Lab", "ADT")) +
  scale_shape_manual(values = c(1, 2, 3, 4, 5, 6, 7, 8, 9)) +
  theme(legend.position = "bottom", legend.box = "vertical", legend.key.size = unit(0.5, "cm"))

ggsave("q1_plot.png", plot = Q1_plot, width = 15, height = 6, dpi = 300)
Warning: Using linewidth for a discrete variable is not advised.

Here is my plot:

Q1.2 ICU stays

ICU stays are a subset of ADT history. This figure shows the vitals of the patient 10001217 during ICU stays. The x-axis is the calendar time, and the y-axis is the value of the vital. The color of the line represents the type of vital. The facet grid shows the abbreviation of the vital and the stay ID.

Do a similar visualization for the patient 10013310.

Import ‘icustays.csv.gz’ as a tibble sid_icustays.

sid_icustays <- read_csv("~/mimic/icu/icustays.csv.gz") |>
  filter(subject_id == sid) |>
  print(width = Inf)
Rows: 73181 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): first_careunit, last_careunit
dbl  (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 2 × 8
  subject_id  hadm_id  stay_id first_careunit                                 
       <dbl>    <dbl>    <dbl> <chr>                                          
1   10013310 22098926 32769810 Neuro Surgical Intensive Care Unit (Neuro SICU)
2   10013310 27682188 31203589 Coronary Care Unit (CCU)                       
  last_careunit            intime              outtime               los
  <chr>                    <dttm>              <dttm>              <dbl>
1 Neuro Intermediate       2153-06-10 11:55:42 2153-06-16 19:03:14  6.30
2 Coronary Care Unit (CCU) 2153-05-06 18:28:00 2153-05-07 20:47:19  1.10

Import ‘chartevents.parquet’ as a tibble sid_chart.

data_chartevents <- arrow::open_dataset("~/mimic/icu/chartevents.csv.gz", format = "csv")
arrow::write_dataset(data_chartevents, "chartevents.parquet")
sid_chart <- arrow::open_dataset("chartevents.parquet") %>%
  dplyr::filter(subject_id == sid &
    itemid %in% c(220045, 220179, 220180, 223761, 220210)) %>%
  as_tibble() %>%
  print(width = Inf)
# A tibble: 549 × 11
   subject_id  hadm_id  stay_id caregiver_id charttime          
        <int>    <int>    <int>        <int> <dttm>             
 1   10013310 22098926 32769810        10285 2153-06-11 01:00:00
 2   10013310 22098926 32769810        10285 2153-06-11 02:00:00
 3   10013310 22098926 32769810        10285 2153-06-11 02:00:00
 4   10013310 22098926 32769810        10285 2153-06-11 02:02:00
 5   10013310 22098926 32769810        10285 2153-06-11 02:02:00
 6   10013310 22098926 32769810        10285 2153-06-12 00:00:00
 7   10013310 22098926 32769810        10285 2153-06-12 00:00:00
 8   10013310 22098926 32769810        10285 2153-06-12 00:03:00
 9   10013310 22098926 32769810        10285 2153-06-12 00:03:00
10   10013310 22098926 32769810        10285 2153-06-12 01:00:00
   storetime           itemid value valuenum valueuom warning
   <dttm>               <int> <chr>    <dbl> <chr>      <int>
 1 2153-06-11 02:25:00 223761 98.8      98.8 °F             0
 2 2153-06-11 02:25:00 220045 113      113   bpm            0
 3 2153-06-11 02:25:00 220210 26        26   insp/min       0
 4 2153-06-11 02:25:00 220179 131      131   mmHg           0
 5 2153-06-11 02:25:00 220180 62        62   mmHg           0
 6 2153-06-12 00:37:00 220045 121      121   bpm            0
 7 2153-06-12 00:37:00 220210 25        25   insp/min       0
 8 2153-06-12 00:37:00 220179 134      134   mmHg           0
 9 2153-06-12 00:37:00 220180 70        70   mmHg           0
10 2153-06-12 00:37:00 223761 99        99   °F             0
# ℹ 539 more rows

Import ‘d_items.csv.gz’ as a tibble d_items.

d_items <- read_csv("~/mimic/icu/d_items.csv.gz") |>
  print(width = Inf)
Rows: 4014 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): label, abbreviation, linksto, category, unitname, param_type
dbl (3): itemid, lownormalvalue, highnormalvalue

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 4,014 × 9
   itemid label                               abbreviation       linksto       
    <dbl> <chr>                               <chr>              <chr>         
 1 220001 Problem List                        Problem List       chartevents   
 2 220003 ICU Admission date                  ICU Admission date datetimeevents
 3 220045 Heart Rate                          HR                 chartevents   
 4 220046 Heart rate Alarm - High             HR Alarm - High    chartevents   
 5 220047 Heart Rate Alarm - Low              HR Alarm - Low     chartevents   
 6 220048 Heart Rhythm                        Heart Rhythm       chartevents   
 7 220050 Arterial Blood Pressure systolic    ABPs               chartevents   
 8 220051 Arterial Blood Pressure diastolic   ABPd               chartevents   
 9 220052 Arterial Blood Pressure mean        ABPm               chartevents   
10 220056 Arterial Blood Pressure Alarm - Low ABP Alarm - Low    chartevents   
   category            unitname param_type    lownormalvalue highnormalvalue
   <chr>               <chr>    <chr>                  <dbl>           <dbl>
 1 General             <NA>     Text                      NA              NA
 2 ADT                 <NA>     Date and time             NA              NA
 3 Routine Vital Signs bpm      Numeric                   NA              NA
 4 Alarms              bpm      Numeric                   NA              NA
 5 Alarms              bpm      Numeric                   NA              NA
 6 Routine Vital Signs <NA>     Text                      NA              NA
 7 Routine Vital Signs mmHg     Numeric                   90             140
 8 Routine Vital Signs mmHg     Numeric                   60              90
 9 Routine Vital Signs mmHg     Numeric                   NA              NA
10 Alarms              mmHg     Numeric                   NA              NA
# ℹ 4,004 more rows
sid_chart <- left_join(sid_chart, d_items, by = c("itemid" = "itemid")) %>%
  select(subject_id, stay_id, charttime, itemid, abbreviation, valuenum)
Q2_plot <- ggplot(
  sid_chart,
  aes(
    x = charttime, y = valuenum,
    color = abbreviation, group = abbreviation
  )
) +
  geom_line() +
  geom_point() +
  facet_grid(abbreviation ~ stay_id, scales = "free") +
  labs(
    x = "", y = "", color = "Type of Vital",
    title = str_c("Patient ", sid, " ICU stays - Vitals")
  ) +
  guides(color = FALSE)
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ggsave("q2_plot.png",
  plot = Q2_plot,
  width = 15, height = 6, dpi = 300
)

Here is my plot:

Q2. ICU stays

icustays.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/icustays/) contains data about Intensive Care Units (ICU) stays. The first 10 lines are

zcat < ~/mimic/icu/icustays.csv.gz | head
subject_id,hadm_id,stay_id,first_careunit,last_careunit,intime,outtime,los
10000032,29079034,39553978,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2180-07-23 14:00:00,2180-07-23 23:50:47,0.4102662037037037
10000980,26913865,39765666,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2189-06-27 08:42:00,2189-06-27 20:38:27,0.4975347222222222
10001217,24597018,37067082,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-11-20 19:18:02,2157-11-21 22:08:00,1.1180324074074075
10001217,27703517,34592300,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-12-19 15:42:24,2157-12-20 14:27:41,0.9481134259259258
10001725,25563031,31205490,Medical/Surgical Intensive Care Unit (MICU/SICU),Medical/Surgical Intensive Care Unit (MICU/SICU),2110-04-11 15:52:22,2110-04-12 23:59:56,1.338587962962963
10001884,26184834,37510196,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2131-01-11 04:20:05,2131-01-20 08:27:30,9.171817129629629
10002013,23581541,39060235,Cardiac Vascular Intensive Care Unit (CVICU),Cardiac Vascular Intensive Care Unit (CVICU),2160-05-18 10:00:53,2160-05-19 17:33:33,1.3143518518518518
10002155,20345487,32358465,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2131-03-09 21:33:00,2131-03-10 18:09:21,0.8585763888888889
10002155,23822395,33685454,Coronary Care Unit (CCU),Coronary Care Unit (CCU),2129-08-04 12:45:00,2129-08-10 17:02:38,6.178912037037037

Q2.1 Ingestion

Import icustays.csv.gz as a tibble icustays_tble.

icustays_tble <- read_csv("~/mimic/icu/icustays.csv.gz") |>
  print(width = Inf)
Rows: 73181 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): first_careunit, last_careunit
dbl  (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 73,181 × 8
   subject_id  hadm_id  stay_id first_careunit                                  
        <dbl>    <dbl>    <dbl> <chr>                                           
 1   10000032 29079034 39553978 Medical Intensive Care Unit (MICU)              
 2   10000980 26913865 39765666 Medical Intensive Care Unit (MICU)              
 3   10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)             
 4   10001217 27703517 34592300 Surgical Intensive Care Unit (SICU)             
 5   10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
 6   10001884 26184834 37510196 Medical Intensive Care Unit (MICU)              
 7   10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)    
 8   10002155 20345487 32358465 Medical Intensive Care Unit (MICU)              
 9   10002155 23822395 33685454 Coronary Care Unit (CCU)                        
10   10002155 28994087 31090461 Medical/Surgical Intensive Care Unit (MICU/SICU)
   last_careunit                                    intime             
   <chr>                                            <dttm>             
 1 Medical Intensive Care Unit (MICU)               2180-07-23 14:00:00
 2 Medical Intensive Care Unit (MICU)               2189-06-27 08:42:00
 3 Surgical Intensive Care Unit (SICU)              2157-11-20 19:18:02
 4 Surgical Intensive Care Unit (SICU)              2157-12-19 15:42:24
 5 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22
 6 Medical Intensive Care Unit (MICU)               2131-01-11 04:20:05
 7 Cardiac Vascular Intensive Care Unit (CVICU)     2160-05-18 10:00:53
 8 Medical Intensive Care Unit (MICU)               2131-03-09 21:33:00
 9 Coronary Care Unit (CCU)                         2129-08-04 12:45:00
10 Medical/Surgical Intensive Care Unit (MICU/SICU) 2130-09-24 00:50:00
   outtime               los
   <dttm>              <dbl>
 1 2180-07-23 23:50:47 0.410
 2 2189-06-27 20:38:27 0.498
 3 2157-11-21 22:08:00 1.12 
 4 2157-12-20 14:27:41 0.948
 5 2110-04-12 23:59:56 1.34 
 6 2131-01-20 08:27:30 9.17 
 7 2160-05-19 17:33:33 1.31 
 8 2131-03-10 18:09:21 0.859
 9 2129-08-10 17:02:38 6.18 
10 2130-09-27 22:13:41 3.89 
# ℹ 73,171 more rows

Q2.2 Summary and visualization

How many unique subject_id? Can a subject_id have multiple ICU stays? Summarize the number of ICU stays per subject_id by graphs.

icustays_tble %>%
  distinct(subject_id) %>%
  count()
# A tibble: 1 × 1
      n
  <int>
1 50920
  • There are 50920 unique subject_id.

  • A subject_id can have multiple ICU stays:

icu_stays_summary <- icustays_tble %>%
  group_by(subject_id) %>%
  summarise(num_icu_stays = n())
icu_stays_summary
# A tibble: 50,920 × 2
   subject_id num_icu_stays
        <dbl>         <int>
 1   10000032             1
 2   10000980             1
 3   10001217             2
 4   10001725             1
 5   10001884             1
 6   10002013             1
 7   10002155             3
 8   10002348             1
 9   10002428             4
10   10002430             1
# ℹ 50,910 more rows
icu_stays_summary %>%
  ggplot(aes(x = num_icu_stays)) +
  geom_bar(fill = "skyblue", color = "black") +
  scale_x_continuous(breaks = seq(0, 10, by = 1)) +
  labs(
    title = "Number of ICU Stays per Patient",
    x = "Number of ICU Stays",
    y = "Count"
  )

Q3. admissions data

Information of the patients admitted into hospital is available in admissions.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/admissions/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/admissions.csv.gz | head
subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admit_provider_id,admission_location,discharge_location,insurance,language,marital_status,race,edregtime,edouttime,hospital_expire_flag
10000032,22595853,2180-05-06 22:23:00,2180-05-07 17:15:00,,URGENT,P874LG,TRANSFER FROM HOSPITAL,HOME,Other,ENGLISH,WIDOWED,WHITE,2180-05-06 19:17:00,2180-05-06 23:30:00,0
10000032,22841357,2180-06-26 18:27:00,2180-06-27 18:49:00,,EW EMER.,P09Q6Y,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-06-26 15:54:00,2180-06-26 21:31:00,0
10000032,25742920,2180-08-05 23:44:00,2180-08-07 17:50:00,,EW EMER.,P60CC5,EMERGENCY ROOM,HOSPICE,Medicaid,ENGLISH,WIDOWED,WHITE,2180-08-05 20:58:00,2180-08-06 01:44:00,0
10000032,29079034,2180-07-23 12:35:00,2180-07-25 17:55:00,,EW EMER.,P30KEH,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-07-23 05:54:00,2180-07-23 14:00:00,0
10000068,25022803,2160-03-03 23:16:00,2160-03-04 06:26:00,,EU OBSERVATION,P51VDL,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2160-03-03 21:55:00,2160-03-04 06:26:00,0
10000084,23052089,2160-11-21 01:56:00,2160-11-25 14:52:00,,EW EMER.,P6957U,WALK-IN/SELF REFERRAL,HOME HEALTH CARE,Medicare,ENGLISH,MARRIED,WHITE,2160-11-20 20:36:00,2160-11-21 03:20:00,0
10000084,29888819,2160-12-28 05:11:00,2160-12-28 16:07:00,,EU OBSERVATION,P63AD6,PHYSICIAN REFERRAL,,Medicare,ENGLISH,MARRIED,WHITE,2160-12-27 18:32:00,2160-12-28 16:07:00,0
10000108,27250926,2163-09-27 23:17:00,2163-09-28 09:04:00,,EU OBSERVATION,P38XXV,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2163-09-27 16:18:00,2163-09-28 09:04:00,0
10000117,22927623,2181-11-15 02:05:00,2181-11-15 14:52:00,,EU OBSERVATION,P2358X,EMERGENCY ROOM,,Other,ENGLISH,DIVORCED,WHITE,2181-11-14 21:51:00,2181-11-15 09:57:00,0

Q3.1 Ingestion

Import admissions.csv.gz as a tibble admissions_tble.

admissions_tble <- read_csv("~/mimic/hosp/admissions.csv.gz") |>
  print(width = Inf)
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 431,231 × 16
   subject_id  hadm_id admittime           dischtime           deathtime
        <dbl>    <dbl> <dttm>              <dttm>              <dttm>   
 1   10000032 22595853 2180-05-06 22:23:00 2180-05-07 17:15:00 NA       
 2   10000032 22841357 2180-06-26 18:27:00 2180-06-27 18:49:00 NA       
 3   10000032 25742920 2180-08-05 23:44:00 2180-08-07 17:50:00 NA       
 4   10000032 29079034 2180-07-23 12:35:00 2180-07-25 17:55:00 NA       
 5   10000068 25022803 2160-03-03 23:16:00 2160-03-04 06:26:00 NA       
 6   10000084 23052089 2160-11-21 01:56:00 2160-11-25 14:52:00 NA       
 7   10000084 29888819 2160-12-28 05:11:00 2160-12-28 16:07:00 NA       
 8   10000108 27250926 2163-09-27 23:17:00 2163-09-28 09:04:00 NA       
 9   10000117 22927623 2181-11-15 02:05:00 2181-11-15 14:52:00 NA       
10   10000117 27988844 2183-09-18 18:10:00 2183-09-21 16:30:00 NA       
   admission_type    admit_provider_id admission_location     discharge_location
   <chr>             <chr>             <chr>                  <chr>             
 1 URGENT            P874LG            TRANSFER FROM HOSPITAL HOME              
 2 EW EMER.          P09Q6Y            EMERGENCY ROOM         HOME              
 3 EW EMER.          P60CC5            EMERGENCY ROOM         HOSPICE           
 4 EW EMER.          P30KEH            EMERGENCY ROOM         HOME              
 5 EU OBSERVATION    P51VDL            EMERGENCY ROOM         <NA>              
 6 EW EMER.          P6957U            WALK-IN/SELF REFERRAL  HOME HEALTH CARE  
 7 EU OBSERVATION    P63AD6            PHYSICIAN REFERRAL     <NA>              
 8 EU OBSERVATION    P38XXV            EMERGENCY ROOM         <NA>              
 9 EU OBSERVATION    P2358X            EMERGENCY ROOM         <NA>              
10 OBSERVATION ADMIT P75S70            WALK-IN/SELF REFERRAL  HOME HEALTH CARE  
   insurance language marital_status race  edregtime          
   <chr>     <chr>    <chr>          <chr> <dttm>             
 1 Other     ENGLISH  WIDOWED        WHITE 2180-05-06 19:17:00
 2 Medicaid  ENGLISH  WIDOWED        WHITE 2180-06-26 15:54:00
 3 Medicaid  ENGLISH  WIDOWED        WHITE 2180-08-05 20:58:00
 4 Medicaid  ENGLISH  WIDOWED        WHITE 2180-07-23 05:54:00
 5 Other     ENGLISH  SINGLE         WHITE 2160-03-03 21:55:00
 6 Medicare  ENGLISH  MARRIED        WHITE 2160-11-20 20:36:00
 7 Medicare  ENGLISH  MARRIED        WHITE 2160-12-27 18:32:00
 8 Other     ENGLISH  SINGLE         WHITE 2163-09-27 16:18:00
 9 Other     ENGLISH  DIVORCED       WHITE 2181-11-14 21:51:00
10 Other     ENGLISH  DIVORCED       WHITE 2183-09-18 08:41:00
   edouttime           hospital_expire_flag
   <dttm>                             <dbl>
 1 2180-05-06 23:30:00                    0
 2 2180-06-26 21:31:00                    0
 3 2180-08-06 01:44:00                    0
 4 2180-07-23 14:00:00                    0
 5 2160-03-04 06:26:00                    0
 6 2160-11-21 03:20:00                    0
 7 2160-12-28 16:07:00                    0
 8 2163-09-28 09:04:00                    0
 9 2181-11-15 09:57:00                    0
10 2183-09-18 20:20:00                    0
# ℹ 431,221 more rows
ad_summary <- admissions_tble %>%
  group_by(subject_id) %>%
  summarise(num_hadm_id = n())

Q3.2 Summary and visualization

Summarize the following information by graphics and explain any patterns you see.

  • number of admissions per patient
  • admission hour (anything unusual?)
  • admission minute (anything unusual?)
  • length of hospital stay (from admission to discharge) (anything unusual?)

Number of admissions per patient

admissions_tble %>%
  group_by(subject_id) %>%
  summarise(num_hadm_id = n()) %>%
  ggplot(aes(x = num_hadm_id)) +
  geom_bar(fill = "skyblue", color = "black") +
  scale_x_continuous(breaks = seq(0, 250, by = 10)) +
  labs(
    title = "Number of Admissions per Patient",
    x = "Number of Admissions",
    y = "Count"
  )

mean(ad_summary$num_hadm_id)
[1] 2.386011
median(ad_summary$num_hadm_id)
[1] 1
sum(ad_summary$num_hadm_id == 1) / nrow(ad_summary)
[1] 0.5599309

Over a half of the patients have only one admission, but the outliers have more than 100 or even 200 admissions.

This situation is likely representative of a population where most people are generally healthy and require hospitalization only occasionally, while a smaller group deals with chronic illnesses or other issues leading to frequent hospital visits.

Admission hour

admissions_tble %>%
  ggplot(aes(x = hour(admittime))) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(
    title = "Admission Hour",
    x = "Hour",
    y = "Count"
  )

Admission hours exhibit a predominant concentration during the afternoon and evening.

There is a noticeable decline in admissions during the early morning hours.

Despite this trend, a notable peak is observed at 7 a.m which is unusual. I think this is due to the fact that the hospital staff are changing shifts at this time, and the new staff are admitting patients who have been waiting for a bed overnight.

Midnight is the peak of all time, indicating potential administrative protocols or a default timestamp for admissions occurring late at night.

Admission minute

admissions_tble %>%
  ggplot(aes(x = minute(admittime))) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(
    title = "Admission Minute",
    x = "Minute",
    y = "Count"
  )

Admission minutes exhibit a peak every 15 minutes and the peaks are decreasing as the minute increases.

Other miuntes are relatively evenly distributed.

This may be due to some certain procedures or administrative protocols.

Length of hospital stay (from admission to discharge)

admissions_tble %>%
  mutate(los = as.numeric(difftime(dischtime, admittime, units = "days"))) %>%
  ggplot(aes(x = los)) +
  geom_bar(fill = "skyblue", color = "black") +
  scale_x_continuous(breaks = seq(0, 250, by = 10)) +
  labs(
    title = "Length of Hospital Stay",
    x = "Length of Hospital Stay (days)",
    y = "Count"
  )

This distribution underscores the diversity in the duration of hospital stays, with most being short and only a minority being notably prolonged.

According to the MIMIC-IV documentation,

All dates in the database have been shifted to protect patient confidentiality. Dates will be internally consistent for the same patient, but randomly distributed in the future. Dates of birth which occur in the present time are not true dates of birth. Furthermore, dates of birth which occur before the year 1900 occur if the patient is older than 89. In these cases, the patient’s age at their first admission has been fixed to 300.

Q4. patients data

Patient information is available in patients.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/patients/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/patients.csv.gz | head
subject_id,gender,anchor_age,anchor_year,anchor_year_group,dod
10000032,F,52,2180,2014 - 2016,2180-09-09
10000048,F,23,2126,2008 - 2010,
10000068,F,19,2160,2008 - 2010,
10000084,M,72,2160,2017 - 2019,2161-02-13
10000102,F,27,2136,2008 - 2010,
10000108,M,25,2163,2014 - 2016,
10000115,M,24,2154,2017 - 2019,
10000117,F,48,2174,2008 - 2010,
10000178,F,59,2157,2017 - 2019,

Q4.1 Ingestion

Import patients.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/patients/) as a tibble patients_tble.

patients_tble <- read_csv("~/mimic/hosp/patients.csv.gz") |>
  print(width = Inf)
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 299,712 × 6
   subject_id gender anchor_age anchor_year anchor_year_group dod       
        <dbl> <chr>       <dbl>       <dbl> <chr>             <date>    
 1   10000032 F              52        2180 2014 - 2016       2180-09-09
 2   10000048 F              23        2126 2008 - 2010       NA        
 3   10000068 F              19        2160 2008 - 2010       NA        
 4   10000084 M              72        2160 2017 - 2019       2161-02-13
 5   10000102 F              27        2136 2008 - 2010       NA        
 6   10000108 M              25        2163 2014 - 2016       NA        
 7   10000115 M              24        2154 2017 - 2019       NA        
 8   10000117 F              48        2174 2008 - 2010       NA        
 9   10000178 F              59        2157 2017 - 2019       NA        
10   10000248 M              34        2192 2014 - 2016       NA        
# ℹ 299,702 more rows

Q4.2 Summary and visualization

Summarize variables gender and anchor_age by graphics, and explain any patterns you see.

library(patchwork)
gender_summary <- patients_tble %>%
  group_by(gender) %>%
  summarise(count = n())

# Bar plot for gender
Q4_plot1 <- ggplot(gender_summary, aes(x = gender, y = count)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(
    title = "Gender Distribution",
    x = "Gender",
    y = "Count"
  )

# Histogram for anchor age
Q4_plot2 <- ggplot(patients_tble, aes(x = anchor_age)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
  labs(
    title = "Histogram of Anchor Age",
    x = "Anchor Age",
    y = "Frequency"
  ) +
  scale_x_continuous(breaks = seq(0, 100, by = 10))

Q4_plot1 + Q4_plot2

Females are slightly more than hald of the patients.

The peak appears at younger adults, and a decline as the age increases to middle age.

It starts increasing after 40 years old and reaches the second peak at around 55 years old.

As age increases, there is a corresponding decline in the patient count, aligning with the expectation that not all patients may survive into older age groups.

Q5. Lab results

labevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/labevents/) contains all laboratory measurements for patients. The first 10 lines are

zcat < ~/mimic/hosp/labevents.csv.gz | head
labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
1,10000032,,45421181,51237,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,1.4,1.4,,0.9,1.1,abnormal,ROUTINE,
2,10000032,,45421181,51274,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,___,15.1,sec,9.4,12.5,abnormal,ROUTINE,VERIFIED.
3,10000032,,52958335,50853,P28Z0X,2180-03-23 11:51:00,2180-03-25 11:06:00,___,15,ng/mL,30,60,abnormal,ROUTINE,NEW ASSAY IN USE ___: DETECTS D2 AND D3 25-OH ACCURATELY.
4,10000032,,52958335,50861,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,102,102,IU/L,0,40,abnormal,ROUTINE,
5,10000032,,52958335,50862,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,3.3,3.3,g/dL,3.5,5.2,abnormal,ROUTINE,
6,10000032,,52958335,50863,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,109,109,IU/L,35,105,abnormal,ROUTINE,
7,10000032,,52958335,50864,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,___,8,ng/mL,0,8.7,,ROUTINE,MEASURED BY ___.
8,10000032,,52958335,50868,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,12,12,mEq/L,8,20,,ROUTINE,
9,10000032,,52958335,50878,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,143,143,IU/L,0,40,abnormal,ROUTINE,

d_labitems.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/d_labitems/) is the dictionary of lab measurements.

zcat < ~/mimic/hosp/d_labitems.csv.gz | head
itemid,label,fluid,category
50801,Alveolar-arterial Gradient,Blood,Blood Gas
50802,Base Excess,Blood,Blood Gas
50803,"Calculated Bicarbonate, Whole Blood",Blood,Blood Gas
50804,Calculated Total CO2,Blood,Blood Gas
50805,Carboxyhemoglobin,Blood,Blood Gas
50806,"Chloride, Whole Blood",Blood,Blood Gas
50808,Free Calcium,Blood,Blood Gas
50809,Glucose,Blood,Blood Gas
50810,"Hematocrit, Calculated",Blood,Blood Gas

We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931). Retrieve a subset of labevents.csv.gz that only containing these items for the patients in icustays_tble. Further restrict to the last available measurement (by storetime) before the ICU stay. The final labevents_tble should have one row per ICU stay and columns for each lab measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make labevents_pq folder available at the current working directory hw3, for example, by a symbolic link.

Import ‘d_labitems.csv.gz’ as a tibble d_labitems.

d_labitems <- read_csv("~/mimic/hosp/d_labitems.csv.gz") |>
  print(width = Inf)
Rows: 1622 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): label, fluid, category
dbl (1): itemid

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 1,622 × 4
   itemid label                               fluid category 
    <dbl> <chr>                               <chr> <chr>    
 1  50801 Alveolar-arterial Gradient          Blood Blood Gas
 2  50802 Base Excess                         Blood Blood Gas
 3  50803 Calculated Bicarbonate, Whole Blood Blood Blood Gas
 4  50804 Calculated Total CO2                Blood Blood Gas
 5  50805 Carboxyhemoglobin                   Blood Blood Gas
 6  50806 Chloride, Whole Blood               Blood Blood Gas
 7  50808 Free Calcium                        Blood Blood Gas
 8  50809 Glucose                             Blood Blood Gas
 9  50810 Hematocrit, Calculated              Blood Blood Gas
10  50811 Hemoglobin                          Blood Blood Gas
# ℹ 1,612 more rows
labevents_tble <- arrow::open_dataset("labevents.parquet") %>%
  dplyr::filter(subject_id %in% icustays_tble$subject_id) %>%
  dplyr::filter(itemid %in% c(
    50912, 50971, 50983, 50902,
    50882, 51221, 51301, 50931
  )) %>%
  select(subject_id, itemid, storetime, valuenum) %>%
  collect() %>%
  left_join(icustays_tble, by = "subject_id") %>%
  select(subject_id, stay_id, itemid, valuenum, storetime, intime) %>%
  filter(storetime < intime) %>%
  group_by(subject_id, stay_id, itemid) %>%
  arrange(storetime, .by_group = TRUE) %>%
  slice_tail(n = 1) %>%
  ungroup() %>%
  select(subject_id, stay_id, itemid, valuenum) %>%
  left_join(d_labitems, by = "itemid") %>%
  select(subject_id, stay_id, label, valuenum) %>%
  pivot_wider(names_from = label, values_from = valuenum) %>%
  print(width = Inf)
Warning in left_join(., icustays_tble, by = "subject_id"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 845 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
# A tibble: 68,467 × 10
   subject_id  stay_id Bicarbonate Chloride Creatinine Glucose Potassium Sodium
        <dbl>    <dbl>       <dbl>    <dbl>      <dbl>   <dbl>     <dbl>  <dbl>
 1   10000032 39553978          25       95        0.7     102       6.7    126
 2   10000980 39765666          21      109        2.3      89       3.9    144
 3   10001217 34592300          30      104        0.5      87       4.1    142
 4   10001217 37067082          22      108        0.6     112       4.2    142
 5   10001725 31205490          NA       98       NA        NA       4.1    139
 6   10001884 37510196          30       88        1.1     141       4.5    130
 7   10002013 39060235          24      102        0.9     288       3.5    137
 8   10002155 31090461          23       98        2.8     117       4.9    135
 9   10002155 32358465          26       85        1.4     133       5.7    120
10   10002155 33685454          24      105        1.1     138       4.6    139
   Hematocrit `White Blood Cells`
        <dbl>               <dbl>
 1       41.1                 6.9
 2       27.3                 5.3
 3       37.4                 5.4
 4       38.1                15.7
 5       NA                  NA  
 6       39.7                12.2
 7       34.9                 7.2
 8       25.5                17.9
 9       22.4                 9.8
10       39.7                 7.9
# ℹ 68,457 more rows

Q6. Vitals from charted events

chartevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/mimic/icu/chartevents.csv.gz | head
subject_id,hadm_id,stay_id,caregiver_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220179,82,82,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220180,59,59,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220181,63,63,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220045,94,94,bpm,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220179,85,85,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220180,55,55,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220181,62,62,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220210,20,20,insp/min,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220277,95,95,%,0

d_items.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/d_items/) is the dictionary for the itemid in chartevents.csv.gz.

zcat < ~/mimic/icu/d_items.csv.gz | head
itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
220001,Problem List,Problem List,chartevents,General,,Text,,
220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,
220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,
220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,
220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140
220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90
220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,

We are interested in the vitals for ICU patients: heart rate (220045), systolic non-invasive blood pressure (220179), diastolic non-invasive blood pressure (220180), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items for the patients in icustays_tble. Further restrict to the first vital measurement within the ICU stay. The final chartevents_tble should have one row per ICU stay and columns for each vital measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make chartevents_pq folder available at the current working directory, for example, by a symbolic link.

Import ‘d_items.csv.gz’ as a tibble d_items.

d_items <- read_csv("~/mimic/icu/d_items.csv.gz") |>
  print(width = Inf)
Rows: 4014 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): label, abbreviation, linksto, category, unitname, param_type
dbl (3): itemid, lownormalvalue, highnormalvalue

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 4,014 × 9
   itemid label                               abbreviation       linksto       
    <dbl> <chr>                               <chr>              <chr>         
 1 220001 Problem List                        Problem List       chartevents   
 2 220003 ICU Admission date                  ICU Admission date datetimeevents
 3 220045 Heart Rate                          HR                 chartevents   
 4 220046 Heart rate Alarm - High             HR Alarm - High    chartevents   
 5 220047 Heart Rate Alarm - Low              HR Alarm - Low     chartevents   
 6 220048 Heart Rhythm                        Heart Rhythm       chartevents   
 7 220050 Arterial Blood Pressure systolic    ABPs               chartevents   
 8 220051 Arterial Blood Pressure diastolic   ABPd               chartevents   
 9 220052 Arterial Blood Pressure mean        ABPm               chartevents   
10 220056 Arterial Blood Pressure Alarm - Low ABP Alarm - Low    chartevents   
   category            unitname param_type    lownormalvalue highnormalvalue
   <chr>               <chr>    <chr>                  <dbl>           <dbl>
 1 General             <NA>     Text                      NA              NA
 2 ADT                 <NA>     Date and time             NA              NA
 3 Routine Vital Signs bpm      Numeric                   NA              NA
 4 Alarms              bpm      Numeric                   NA              NA
 5 Alarms              bpm      Numeric                   NA              NA
 6 Routine Vital Signs <NA>     Text                      NA              NA
 7 Routine Vital Signs mmHg     Numeric                   90             140
 8 Routine Vital Signs mmHg     Numeric                   60              90
 9 Routine Vital Signs mmHg     Numeric                   NA              NA
10 Alarms              mmHg     Numeric                   NA              NA
# ℹ 4,004 more rows
chartevents_tble <- arrow::open_dataset("chartevents.parquet") %>%
  filter(itemid %in% c(220045, 220179, 220180, 223761, 220210)) %>%
  select(stay_id, charttime, itemid, valuenum) %>%
  collect()

join_tbl2 <- icustays_tble %>%
  select(subject_id, stay_id, intime) %>%
  left_join(chartevents_tble, by = "stay_id") %>%
  filter(charttime >= intime) %>%
  group_by(subject_id, stay_id, itemid) %>%
  slice_min(order_by = charttime) %>%
  ungroup()

chartevents_tble <- join_tbl2 %>%
  select(subject_id, stay_id, itemid, valuenum) %>%
  left_join(d_items, by = "itemid") %>%
  select(subject_id, stay_id, label, valuenum) %>%
  pivot_wider(names_from = label, values_from = valuenum, values_fn = mean) %>%
  print(width = Inf)
# A tibble: 73,164 × 7
   subject_id  stay_id `Heart Rate` `Non Invasive Blood Pressure systolic`
        <dbl>    <dbl>        <dbl>                                  <dbl>
 1   10000032 39553978           91                                     84
 2   10000980 39765666           77                                    150
 3   10001217 34592300           96                                    167
 4   10001217 37067082           86                                    151
 5   10001725 31205490           55                                     73
 6   10001884 37510196           38                                    180
 7   10002013 39060235           80                                    104
 8   10002155 31090461           94                                    118
 9   10002155 32358465           98                                    109
10   10002155 33685454           68                                    126
   `Non Invasive Blood Pressure diastolic` `Respiratory Rate`
                                     <dbl>              <dbl>
 1                                      48                 24
 2                                      77                 23
 3                                      95                 11
 4                                      90                 18
 5                                      56                 19
 6                                      12                 10
 7                                      70                 14
 8                                      51                 18
 9                                      65                 23
10                                      61                 18
   `Temperature Fahrenheit`
                      <dbl>
 1                     98.7
 2                     98  
 3                     97.6
 4                     98.5
 5                     97.7
 6                     98.1
 7                     97.2
 8                     96.9
 9                     97.7
10                     95.9
# ℹ 73,154 more rows

Q7. Putting things together

Let us create a tibble mimic_icu_cohort for all ICU stays, where rows are all ICU stays of adults (age at intime >= 18) and columns contain at least following variables

  • all variables in icustays_tble
  • all variables in admissions_tble
  • all variables in patients_tble
  • the last lab measurements before the ICU stay in labevents_tble
  • the first vital measurements during the ICU stay in chartevents_tble

The final mimic_icu_cohort should have one row per ICU stay and columns for each variable.

mimic_icu_cohort <-
  icustays_tble %>%
  left_join(admissions_tble, by = c("subject_id", "hadm_id")) %>%
  left_join(patients_tble, by = "subject_id") %>%
  mutate(intimeage = year(intime) - anchor_year + anchor_age) %>%
  filter(intimeage >= 18) %>%
  left_join(labevents_tble, by = c("subject_id", "stay_id")) %>%
  left_join(chartevents_tble, by = c("subject_id", "stay_id")) %>%
  print(width = Inf)
# A tibble: 73,181 × 41
   subject_id  hadm_id  stay_id first_careunit                                  
        <dbl>    <dbl>    <dbl> <chr>                                           
 1   10000032 29079034 39553978 Medical Intensive Care Unit (MICU)              
 2   10000980 26913865 39765666 Medical Intensive Care Unit (MICU)              
 3   10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)             
 4   10001217 27703517 34592300 Surgical Intensive Care Unit (SICU)             
 5   10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
 6   10001884 26184834 37510196 Medical Intensive Care Unit (MICU)              
 7   10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)    
 8   10002155 20345487 32358465 Medical Intensive Care Unit (MICU)              
 9   10002155 23822395 33685454 Coronary Care Unit (CCU)                        
10   10002155 28994087 31090461 Medical/Surgical Intensive Care Unit (MICU/SICU)
   last_careunit                                    intime             
   <chr>                                            <dttm>             
 1 Medical Intensive Care Unit (MICU)               2180-07-23 14:00:00
 2 Medical Intensive Care Unit (MICU)               2189-06-27 08:42:00
 3 Surgical Intensive Care Unit (SICU)              2157-11-20 19:18:02
 4 Surgical Intensive Care Unit (SICU)              2157-12-19 15:42:24
 5 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22
 6 Medical Intensive Care Unit (MICU)               2131-01-11 04:20:05
 7 Cardiac Vascular Intensive Care Unit (CVICU)     2160-05-18 10:00:53
 8 Medical Intensive Care Unit (MICU)               2131-03-09 21:33:00
 9 Coronary Care Unit (CCU)                         2129-08-04 12:45:00
10 Medical/Surgical Intensive Care Unit (MICU/SICU) 2130-09-24 00:50:00
   outtime               los admittime           dischtime          
   <dttm>              <dbl> <dttm>              <dttm>             
 1 2180-07-23 23:50:47 0.410 2180-07-23 12:35:00 2180-07-25 17:55:00
 2 2189-06-27 20:38:27 0.498 2189-06-27 07:38:00 2189-07-03 03:00:00
 3 2157-11-21 22:08:00 1.12  2157-11-18 22:56:00 2157-11-25 18:00:00
 4 2157-12-20 14:27:41 0.948 2157-12-18 16:58:00 2157-12-24 14:55:00
 5 2110-04-12 23:59:56 1.34  2110-04-11 15:08:00 2110-04-14 15:00:00
 6 2131-01-20 08:27:30 9.17  2131-01-07 20:39:00 2131-01-20 05:15:00
 7 2160-05-19 17:33:33 1.31  2160-05-18 07:45:00 2160-05-23 13:30:00
 8 2131-03-10 18:09:21 0.859 2131-03-09 20:33:00 2131-03-10 01:55:00
 9 2129-08-10 17:02:38 6.18  2129-08-04 12:44:00 2129-08-18 16:53:00
10 2130-09-27 22:13:41 3.89  2130-09-23 21:59:00 2130-09-29 18:55:00
   deathtime           admission_type              admit_provider_id
   <dttm>              <chr>                       <chr>            
 1 NA                  EW EMER.                    P30KEH           
 2 NA                  EW EMER.                    P30KEH           
 3 NA                  EW EMER.                    P4645A           
 4 NA                  DIRECT EMER.                P99698           
 5 NA                  EW EMER.                    P35SU0           
 6 2131-01-20 05:15:00 OBSERVATION ADMIT           P874LG           
 7 NA                  SURGICAL SAME DAY ADMISSION P47E1G           
 8 2131-03-10 21:53:00 EW EMER.                    P80515           
 9 NA                  EW EMER.                    P05HUO           
10 NA                  EW EMER.                    P3529J           
   admission_location discharge_location           insurance language
   <chr>              <chr>                        <chr>     <chr>   
 1 EMERGENCY ROOM     HOME                         Medicaid  ENGLISH 
 2 EMERGENCY ROOM     HOME HEALTH CARE             Medicare  ENGLISH 
 3 EMERGENCY ROOM     HOME HEALTH CARE             Other     ?       
 4 PHYSICIAN REFERRAL HOME HEALTH CARE             Other     ?       
 5 PACU               HOME                         Other     ENGLISH 
 6 EMERGENCY ROOM     DIED                         Medicare  ENGLISH 
 7 PHYSICIAN REFERRAL HOME HEALTH CARE             Medicare  ENGLISH 
 8 EMERGENCY ROOM     DIED                         Other     ENGLISH 
 9 PROCEDURE SITE     CHRONIC/LONG TERM ACUTE CARE Other     ENGLISH 
10 EMERGENCY ROOM     HOME HEALTH CARE             Other     ENGLISH 
   marital_status race                   edregtime           edouttime          
   <chr>          <chr>                  <dttm>              <dttm>             
 1 WIDOWED        WHITE                  2180-07-23 05:54:00 2180-07-23 14:00:00
 2 MARRIED        BLACK/AFRICAN AMERICAN 2189-06-27 06:25:00 2189-06-27 08:42:00
 3 MARRIED        WHITE                  2157-11-18 17:38:00 2157-11-19 01:24:00
 4 MARRIED        WHITE                  NA                  NA                 
 5 MARRIED        WHITE                  NA                  NA                 
 6 MARRIED        BLACK/AFRICAN AMERICAN 2131-01-07 13:36:00 2131-01-07 22:13:00
 7 SINGLE         OTHER                  NA                  NA                 
 8 MARRIED        WHITE                  2131-03-09 19:14:00 2131-03-09 21:33:00
 9 MARRIED        WHITE                  2129-08-04 11:00:00 2129-08-04 12:35:00
10 MARRIED        WHITE                  2130-09-23 19:59:00 2130-09-24 00:50:00
   hospital_expire_flag gender anchor_age anchor_year anchor_year_group
                  <dbl> <chr>       <dbl>       <dbl> <chr>            
 1                    0 F              52        2180 2014 - 2016      
 2                    0 F              73        2186 2008 - 2010      
 3                    0 F              55        2157 2011 - 2013      
 4                    0 F              55        2157 2011 - 2013      
 5                    0 F              46        2110 2011 - 2013      
 6                    1 F              68        2122 2008 - 2010      
 7                    0 F              53        2156 2008 - 2010      
 8                    1 F              80        2128 2008 - 2010      
 9                    0 F              80        2128 2008 - 2010      
10                    0 F              80        2128 2008 - 2010      
   dod        intimeage Bicarbonate Chloride Creatinine Glucose Potassium Sodium
   <date>         <dbl>       <dbl>    <dbl>      <dbl>   <dbl>     <dbl>  <dbl>
 1 2180-09-09        52          25       95        0.7     102       6.7    126
 2 2193-08-26        76          21      109        2.3      89       3.9    144
 3 NA                55          22      108        0.6     112       4.2    142
 4 NA                55          30      104        0.5      87       4.1    142
 5 NA                46          NA       98       NA        NA       4.1    139
 6 2131-01-20        77          30       88        1.1     141       4.5    130
 7 NA                57          24      102        0.9     288       3.5    137
 8 2131-03-10        83          26       85        1.4     133       5.7    120
 9 2131-03-10        81          24      105        1.1     138       4.6    139
10 2131-03-10        82          23       98        2.8     117       4.9    135
   Hematocrit `White Blood Cells` `Heart Rate`
        <dbl>               <dbl>        <dbl>
 1       41.1                 6.9           91
 2       27.3                 5.3           77
 3       38.1                15.7           86
 4       37.4                 5.4           96
 5       NA                  NA             55
 6       39.7                12.2           38
 7       34.9                 7.2           80
 8       22.4                 9.8           98
 9       39.7                 7.9           68
10       25.5                17.9           94
   `Non Invasive Blood Pressure systolic`
                                    <dbl>
 1                                     84
 2                                    150
 3                                    151
 4                                    167
 5                                     73
 6                                    180
 7                                    104
 8                                    109
 9                                    126
10                                    118
   `Non Invasive Blood Pressure diastolic` `Respiratory Rate`
                                     <dbl>              <dbl>
 1                                      48                 24
 2                                      77                 23
 3                                      90                 18
 4                                      95                 11
 5                                      56                 19
 6                                      12                 10
 7                                      70                 14
 8                                      65                 23
 9                                      61                 18
10                                      51                 18
   `Temperature Fahrenheit`
                      <dbl>
 1                     98.7
 2                     98  
 3                     98.5
 4                     97.6
 5                     97.7
 6                     98.1
 7                     97.2
 8                     97.7
 9                     95.9
10                     96.9
# ℹ 73,171 more rows

Q8. Exploratory data analysis (EDA)

Summarize the following information about the ICU stay cohort mimic_icu_cohort using appropriate numerics or graphs:

Length of ICU stay los vs demographic variables (race, insurance, marital_status, gender, age at intime)

mimic_icu_cohort %>%
  group_by(race) %>%
  summarise(
    mean_los = mean(los),
    median_los = median(los),
    sum_los = sum(los),
    sd_los = sd(los)
  ) %>%
  print(width = Inf) %>%
  ggplot(aes(x = sum_los, y = race)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Total Length of ICU stay vs Race", x = "", y = "")
# A tibble: 33 × 5
   race                          mean_los median_los sum_los sd_los
   <chr>                            <dbl>      <dbl>   <dbl>  <dbl>
 1 AMERICAN INDIAN/ALASKA NATIVE     4.46       2.09    624.   6.54
 2 ASIAN                             3.49       1.93   2928.   4.90
 3 ASIAN - ASIAN INDIAN              4.20       1.95    752.   7.61
 4 ASIAN - CHINESE                   3.36       1.86   2660.   5.39
 5 ASIAN - KOREAN                    4.23       2.08    216.   6.35
 6 ASIAN - SOUTH EAST ASIAN          3.04       1.82    890.   5.42
 7 BLACK/AFRICAN                     3.82       2.02   1179.   5.90
 8 BLACK/AFRICAN AMERICAN            3.28       1.85  22024.   4.80
 9 BLACK/CAPE VERDEAN                3.22       1.70   1614.   4.94
10 BLACK/CARIBBEAN ISLAND            3.61       1.95   1538.   5.13
# ℹ 23 more rows

ggplot(mimic_icu_cohort, aes(x = los, y = race)) +
  geom_boxplot() +
  labs(title = "Length of ICU stay vs Demographic Variables", x = "Race", y = "Length of ICU Stay") +
  scale_x_continuous(breaks = seq(0, max(mimic_icu_cohort$los), by = 3))

The length of ICU stay shows little variation across different races, typically within a day.

The White people had the longest total ICU stays and the second is African American. This is consistent with the fact that the White people are the majority of the population in the US.

mimic_icu_cohort %>%
  group_by(insurance) %>%
  summarise(
    mean_los = mean(los),
    median_los = median(los),
    sum_los = sum(los),
    sd_los = sd(los)
  ) %>%
  print(width = Inf)
# A tibble: 3 × 5
  insurance mean_los median_los sum_los sd_los
  <chr>        <dbl>      <dbl>   <dbl>  <dbl>
1 Medicaid      3.46       1.81  19149.   5.18
2 Medicare      3.48       2.01 115264.   4.79
3 Other         3.42       1.87 118203.   5.00
ggplot(
  mimic_icu_cohort,
  aes(x = los, y = insurance)
) +
  geom_boxplot() +
  labs(
    title = "Length of ICU stay vs Insurance Type",
    x = "Length of ICU Stay", y = ""
  ) +
  scale_x_continuous(breaks = seq(0, max(mimic_icu_cohort$los),
    by = 2
  ))

ggplot(mimic_icu_cohort, 
       aes(x = insurance, y = los)) +
  geom_bar(stat = "identity", 
           color = "skyblue") +
  labs(title = "Total Length of ICU stay vs Insurance Type",
       x = "Insurance", y = "Total Length of ICU Stay")

It seems that the length of ICU stay is similar across different insurance types.

Patients with Medicare have longest total ICU stays than those with Medicaid which may suggest Medicare is more likely to be used by patients.

Patients with ‘Other’ insurance type have the longest total ICU stays which may suggest a need for further investigation and more specific categorization.

mimic_icu_cohort %>%
  filter(!is.na(marital_status)) %>%
  group_by(marital_status) %>%
  summarise(
    mean_los = mean(los, na.rm = TRUE),
    median_los = median(los, na.rm = TRUE),
    min_los = min(los, na.rm = TRUE),
    max_los = max(los, na.rm = TRUE),
    sum_los = sum(los, na.rm = TRUE),
    sd_los = sd(los, na.rm = TRUE)
  )
# A tibble: 4 × 7
  marital_status mean_los median_los min_los max_los sum_los sd_los
  <chr>             <dbl>      <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
1 DIVORCED           3.39       1.90 0.0075     91.0  18295.   4.74
2 MARRIED            3.46       1.94 0.00125   110.  113223.   5.00
3 SINGLE             3.40       1.87 0.00227    95.8  70957.   4.92
4 WIDOWED            3.13       1.93 0.00733    86.3  28251.   4.10
ggplot(
  mimic_icu_cohort %>%
    filter(!is.na(marital_status)),
  aes(x = los, y = marital_status)
) +
  geom_boxplot() +
  labs(
    title = "Length of ICU stay vs Marital Status",
    x = "Length of ICU Stay", y = ""
  ) +
  scale_x_continuous(breaks = seq(0, max(mimic_icu_cohort$los),
    by = 2
  ))

ggplot(mimic_icu_cohort %>% 
         filter(!is.na(marital_status)), 
       aes(x = marital_status, y = los)) +
  geom_bar(stat = "identity", color = "skyblue") +
  labs(title = "Total Length of ICU stay vs Marital Status", 
       x = "Marital Status", y = "Total Length of ICU Stay")

It seems not much difference in the length of ICU stay across different marital status.

The married patients have a slightly wider distribution of ICU stays and more outliers than the other groups.

mimic_icu_cohort %>%
  filter(!is.na(gender)) %>%
  group_by(gender) %>%
  summarise(
    mean_los = mean(los, na.rm = TRUE),
    median_los = median(los, na.rm = TRUE),
    min_los = min(los, na.rm = TRUE),
    max_los = max(los, na.rm = TRUE),
    sum_los = sum(los, na.rm = TRUE),
    sd_los = sd(los, na.rm = TRUE)
  )
# A tibble: 2 × 7
  gender mean_los median_los min_los max_los sum_los sd_los
  <chr>     <dbl>      <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
1 F          3.34       1.91 0.00145    110. 108180.   4.70
2 M          3.54       1.94 0.00125    103. 144435.   5.09
ggplot(
  mimic_icu_cohort %>% filter(!is.na(gender)),
  aes(x = los, y = gender)
) +
  geom_boxplot() +
  labs(
    title = "Length of ICU stay vs Gender",
    x = "Length of ICU Stay", y = ""
  ) +
  scale_x_continuous(breaks = seq(0, max(mimic_icu_cohort$los),
    by = 2
  ))

ggplot(mimic_icu_cohort %>%
  filter(!is.na(gender)), aes(x = gender, y = los)) +
  geom_bar(stat = "identity", color = "skyblue") +
  labs(
    title = "Total Length of ICU stay vs Gender",
    x = "Gender", y = "Total Length of ICU Stay"
  )

Male have a slightly wider distribution of ICU stays and more outliers than female.

Male patients also have longer total ICU stays.

mimic_icu_cohort %>%
  filter(!is.na(intimeage)) %>%
  group_by(intimeage) %>%
  summarise(
    mean_los = mean(los),
    median_los = median(los),
    sum_los = sum(los),
    sd_los = sd(los)
  ) %>%
  print(width = Inf)
# A tibble: 84 × 5
   intimeage mean_los median_los sum_los sd_los
       <dbl>    <dbl>      <dbl>   <dbl>  <dbl>
 1        18     1.91       1.23    161.   2.02
 2        19     3.17       1.71    419.   5.03
 3        20     2.61       1.33    641.   3.45
 4        21     2.81       1.50    676.   5.90
 5        22     2.70       1.48    680.   3.79
 6        23     3.09       1.57    841.   4.18
 7        24     3.10       1.74    725.   4.56
 8        25     3.35       1.43    924.   6.50
 9        26     2.86       1.65    899.   4.21
10        27     4.61       1.91   1305.   8.95
# ℹ 74 more rows
ggplot(mimic_icu_cohort %>%
  filter(!is.na(intimeage)), aes(x = intimeage, y = los)) +
  geom_smooth() +
  labs(
    title = "Length of ICU stay vs Age at Intime",
    x = "Age at Intime", y = "Length of ICU Stay"
  )
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

The graph shows a steady increase in ICU stay length up to around age 75, followed by a sharp decline.

Length of ICU stay los vs the last available lab measurements before ICU stay

mimic_icu_cohort %>%
  select(
    Bicarbonate, Chloride, Creatinine,
    Glucose, Potassium, Sodium,
    Hematocrit, `White Blood Cells`, los
  ) %>%
  pivot_longer(
    cols = -los,
    names_to = "abbreviation",
    values_to = "value"
  ) %>%
  ggplot(aes(
    x = los, y = value,
    color = abbreviation,
    group = abbreviation
  )) +
  geom_point() +
  facet_grid(abbreviation ~ ., scales = "free") +
  labs(
    x = "Length of ICU Stay", y = "",
    color = "Type of Measurement",
    title = "ICU Measurements over Length of ICU Stay"
  ) +
  guides(color = FALSE) +
  scale_x_continuous(breaks = seq(0, max(mimic_icu_cohort$los),
    by = 5
  ))
Warning: Removed 60686 rows containing missing values (`geom_point()`).

Most data points are concentrated at lower levels of the last available lab measurements before ICU stay, and correspond to shorter lengths of ICU stay.

This indicates that the majority of patients have lower levels of the last available lab measurements before ICU stay, and require shorter ICU stays.

However, in some cases of abnormal lab measurements, we observe some exceptionally long ICU stays, which may be outliers.

These outliers may represent patients with specific conditions leading to abnormal lab measurements and prolonged ICU stays.

Length of ICU stay los vs the first vital measurements within the ICU stay

mimic_icu_cohort %>%
  select(
    "Heart Rate", "Non Invasive Blood Pressure systolic",
    "Non Invasive Blood Pressure diastolic",
    "Respiratory Rate",
    "Temperature Fahrenheit", los
  ) %>%
  pivot_longer(
    cols = -los,
    names_to = "abbreviation",
    values_to = "value"
  ) %>%
  ggplot(aes(
    x = los, y = value,
    color = abbreviation,
    group = abbreviation
  )) +
  geom_point() +
  facet_grid(abbreviation ~ ., scales = "free") +
  labs(
    x = "Length of ICU Stay", y = "",
    color = "Type of Measurement",
    title = "los vs the first vital measurements within the ICU stay"
  ) +
  guides(color = FALSE) +
  scale_x_continuous(breaks = seq(0, max(mimic_icu_cohort$los),
    by = 5
  ))
Warning: Removed 3410 rows containing missing values (`geom_point()`).

Patients with lower heart rates typically have shorter ICU hospital stays, but individual patients may experience abnormally high heart rates due to specific clinical conditions, leading to longer hospital stays.

Most patients have blood pressure within the normal range, corresponding to shorter ICU hospitalization time. However, extremely high blood pressure may reflect serious medical conditions and further confirmation of accuracy is needed.

Mild increase in respiratory rate is usually associated with shorter ICU hospitalization time, but abnormally high respiratory rate may be associated with more severe disease conditions that require longer ICU treatment.

Normal or slightly elevated body temperature is usually associated with shorter ICU hospitalization time, while abnormally high body temperature may be associated with more severe disease conditions, leading to longer ICU monitoring time.

  • In order to reduce the impact of outliers on the overall graph and better display the distribution of data,

I use a log scale for the y-axis.

mimic_icu_cohort %>%
  select(
    "Heart Rate", "Non Invasive Blood Pressure systolic",
    "Non Invasive Blood Pressure diastolic",
    "Respiratory Rate", "Temperature Fahrenheit", los
  ) %>%
  pivot_longer(
    cols = -los, names_to = "abbreviation",
    values_to = "value"
  ) %>%
  ggplot(aes(
    x = los, y = value,
    color = abbreviation, group = abbreviation
  )) +
  geom_point() +
  facet_grid(abbreviation ~ ., scales = "free") +
  labs(
    x = "Length of ICU Stay", y = "",
    color = "Type of Measurement",
    title = "los vs the first vital measurements within the ICU stay(log10)"
  ) +
  guides(color = FALSE) +
  scale_x_continuous(breaks = seq(0, max(mimic_icu_cohort$los), by = 5)) +
  scale_y_log10()
Warning: Transformation introduced infinite values in continuous y-axis
Warning: Removed 3410 rows containing missing values (`geom_point()`).

###Length of ICU stay los vs first ICU unit

mimic_icu_cohort %>%
  filter(!is.na(first_careunit)) %>%
  group_by(first_careunit) %>%
  summarise(
    mean_los = mean(los, na.rm = TRUE),
    median_los = median(los, na.rm = TRUE),
    min_los = min(los, na.rm = TRUE),
    max_los = max(los, na.rm = TRUE),
    sum_los = sum(los, na.rm = TRUE),
    sd_los = sd(los, na.rm = TRUE)
  )
# A tibble: 9 × 7
  first_careunit              mean_los median_los min_los max_los sum_los sd_los
  <chr>                          <dbl>      <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
1 Cardiac Vascular Intensive…     3.29       1.99 0.00352   103.   38105.   4.89
2 Coronary Care Unit (CCU)        3.19       2.01 0.00125    66.6  26532.   4.00
3 Medical Intensive Care Uni…     3.26       1.83 0.00227   110.   51882.   4.61
4 Medical/Surgical Intensive…     3.08       1.81 0.00145    67.9  39279.   4.30
5 Neuro Intermediate              3.42       2.34 0.00346    43.5   6961.   3.73
6 Neuro Stepdown                  2.59       1.69 0.0570     29.1   2609.   2.89
7 Neuro Surgical Intensive C…     6.30       3.65 0.0216     59.6  11099.   7.46
8 Surgical Intensive Care Un…     3.84       1.97 0.00745   102.   42832.   5.76
9 Trauma SICU (TSICU)             3.83       1.94 0.00218    99.6  33317.   5.52
ggplot(mimic_icu_cohort, aes(x = los, y = first_careunit)) +
  geom_boxplot() +
  labs(title = "Length of ICU stay vs first_careunite",
       x = "Length of ICU Stay", y = "") +
  scale_x_continuous(
    breaks = seq(0, max(mimic_icu_cohort$los), by = 2))

  • Neuro Surgical Intensive Care Unit (Neuro SICU) has longer average ICU stay and larger variation than other units .

This may be due to the fact that patients in Neuro SICU are more likely to have severe conditions and require longer ICU stays.

  • Outliers in units like Trauma SICU, Surgical ICU, Medical ICU, and Cardiac Vascular ICU have notably longer ICU stays, sometimes exceeding 100 days—twice the maximum observed in Neuro Intermediate and three times that of Neuro Stepdown units.

This likely reflects the complexity and severity of conditions treated in these specialized units.